Support determining mode based on shebang interpreter directive #47

iainh · 2022-07-18T01:14:59Z

Attempt to determine the appropriate mode based on the presence of an interpreter directive on the first line of the file. Interpreter directives are examined first, followed by the filename when determining the mode mirroring what Emacs does by default.

Reopening now that highlight-with-queries has been merged. Feel free to close if there is no interest in supporting this feature.

Attempt to determine the appropriate mode based on the presence of an interpreter directive on the first line of the file. Interpreter directives are examined first, followed by the filename when determining the mode mirroring what Emacs does by default.

mcobzarenco

This is really awesome! Thanks so much, it's a very useful addition I wished I had before.

Sorry for the multiple comments and being pedantic about not loading the whole file, happy to help with any of them

mcobzarenco · 2022-07-18T21:33:41Z

zee-grammar/src/config.rs

@@ -17,6 +17,8 @@ pub struct ModeConfig {
    pub comment: Option<CommentConfig>,
    pub indentation: IndentationConfig,
    pub grammar: Option<GrammarConfig>,
+    #[serde(default)]
+    pub shebangs: Vec<String>,


I was thinking whether this should be a new field, rather than a new variant of FilenamePattern -- granted, we should probably rename it to FilePattern as it will look inside the file too to determine if the mode applies. I.e. something like FilePattern::Shebang.

The structure I suggest may make it harder to avoid reading the file when the filename would suffice.

mcobzarenco · 2022-07-18T21:45:28Z

zee/src/editor/buffer.rs

+        let mode = text
+            .line(0)
+            .as_str()
+            .and_then(|shebang| context.0.mode_by_shebang(shebang))
+            .or_else(|| {
+                file_path
+                    .as_ref()
+                    .and_then(|path| context.0.mode_by_filename(path))
+            })


This may potentially read the whole file in pathological cases, e.g. a minified file that has no new lines. One goal of zee is to keep being fast for any kind of pathological file you can think of and do anything that is potentially blocking in the background (e.g. parsing syntax or writing the file to disk). I've also been trying to avoid doing anything linear in the length of a line in the UI thread.

I think the right solution here long term is to build buffers, i.e. call Buffer::new() in a background thread, rather than in the main, UI thread.

For this PR though, I'd be happy if instead you just bound how much of the file you read, say 256 bytes at most and test the regex for that. You'll have to deal with potentially truncated utf-8...

Maybe a better solution is to read characters until you encounter either 1. a new line or 2. if you don't after X characters, you give you and don't check the shebang -- essentially we only test if shebangs apply up to a certain line length.

A 2nd comment is that I think we should test if "mode_by_filename" applies and only then check the shebangs to avoid having to read the file if the name already matches.

Great suggestions. I'm looking to see if there is any standards around whether white space is allowed before the #! and what the maximum length is on most platforms. I have some outdated information that 127-512 bytes is the maximum but with a lack of a standard to point at, I think an overly large maximum might be best. FreeBSD for example historically supported 4096. If white space before the shebang is not allowed then a two pass strategy where only the first two characters are examined followed by the remainder of the line up to the maximum discussed might be the most efficient.

mcobzarenco · 2022-07-18T21:51:09Z

zee/src/editor/mod.rs

-    pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> &Mode {
+    pub fn mode_by_filename(&self, filename: impl AsRef<Path>) -> Option<&Mode> {
        self.modes
            .iter()
            .find(|&mode| mode.matches_by_filename(filename.as_ref()))
-            .unwrap_or(&PLAIN_TEXT_MODE)
+    }
+
+    pub fn mode_by_shebang(&self, shebang: &str) -> Option<&Mode> {
+        self.modes
+            .iter()
+            .find(|&mode| mode.matches_by_shebang(shebang))


If shebang is a variant of FilenamePattern, there would be one function and we could continue to return &Mode rather than Option<&Mode> -- I would like to continue have Context own coming up with a Mode for any possible file whatsoever and not have a default PLAIN_TEXT_MODE potentially duplicated

I've been thinking about merging these methods into something like mode_by_file_pattern that would always return a &Mode but I'm stuck on how to handle the difference in parameters. Two of the variants operate on the filename while the other needs a portion of the file content. To have one method we would have to always pass a portion of the file content to the function in addition to the filename which would require reading at least part of the content of every file.

All of that being said, prior to detecting the mode or creating the buffer open_file() is calling Rope::from_reader() which I think might be reading the entire file anyway.

zee/zee/src/editor/mod.rs

Line 156 in 0b40784

Rope::from_reader(BufReader::new(File::open(&file_path)?))?,

I reading that right? If so we already have the Rope created and available in Buffer::new() so reading a slice should be quite efficient.

Here is an updated version of this code with a limit on the length of text examined for the interpreter directive, an inversion of the order mode checks are performed (filename checks first, then shebang), and a merging of the mode_by_filename() and mode_by_shebang() methods into one, mode_by_file(), which always returns a &Mode.

I ended up settling on 256 characters for the maximum length of the shebang directive, matching what linux has done since 2018 (https://lore.kernel.org/lkml/20181112160956.GA28472@redhat.com/) (https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h)

I haven’t found a satisfactory way of adding a shebang case to FilenamePattern, the biggest stumbling blocks being that the shebang line would need to be passed to the matches() method and the pattern list would need to be sorted/filtered if we wanted to ensure that filename patterns were always examined before shebang directives.

1) Constrain the number of characters examine when determining the interpreter directive to 256 to avoid performance penalty of scanning long lines unnecessarily. 2) Move to a single mode_by_file() function in Context. 3) Always attempt to match the mode based on the filename first, falling back to the shebang line only if no match is found.

zee/config/config.ron

mcobzarenco reviewed Jul 18, 2022

View reviewed changes

iainh added 2 commits July 20, 2022 22:26

Build fix

7936db6

kevinmatthes reviewed Jul 24, 2022

View reviewed changes

zee/config/config.ron Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support determining mode based on shebang interpreter directive #47

Support determining mode based on shebang interpreter directive #47

iainh commented Jul 18, 2022

mcobzarenco left a comment

mcobzarenco Jul 18, 2022

mcobzarenco Jul 18, 2022

mcobzarenco Jul 18, 2022

iainh Jul 20, 2022

mcobzarenco Jul 18, 2022

iainh Jul 20, 2022

iainh Jul 21, 2022

Support determining mode based on shebang interpreter directive #47

Are you sure you want to change the base?

Support determining mode based on shebang interpreter directive #47

Conversation

iainh commented Jul 18, 2022

mcobzarenco left a comment

Choose a reason for hiding this comment

mcobzarenco Jul 18, 2022

Choose a reason for hiding this comment

mcobzarenco Jul 18, 2022

Choose a reason for hiding this comment

mcobzarenco Jul 18, 2022

Choose a reason for hiding this comment

iainh Jul 20, 2022

Choose a reason for hiding this comment

mcobzarenco Jul 18, 2022

Choose a reason for hiding this comment

iainh Jul 20, 2022

Choose a reason for hiding this comment

iainh Jul 21, 2022

Choose a reason for hiding this comment